Lesson 7: Intro to R

Ng Yen Ngee https://www.linkedin.com/in/ng-yen-ngee/
06-26-2021

Introduction to in-class exercise 7

As part of my lesson in SMU Visual Analytics, prof teaches data visualization in R using Rmarkdown. The post below acts as both an in-class exercise for the lesson, as well as my notes.

Note to prof, if he is here: I have reorganized the in class exercise in a way that makes sense to me, but may not follow the step by step process prof went through in class. I have added additional items which I have kept as haphazard notes on my desktop. Hope this is alright!

Load packages

Below is shortcut of how to load the packages in one shot. We can choose to add in whatever packages we want to load in the variable packages. This is an alternative code as oppose to loading the package line by line.

packages <- c('DT', 'ggiraph', 'plotly', 'tidyverse' )

for(p in packages){
  if(!require(p,character.only=T)) {
    install.packages(p)
  }
  library(p, character.only=T)
}

Load data

We use read_csv function which is part of the package in tidyverse to read the csv file. If it is an excel file, we can use read_excel etc.

exam_data <- read_csv("data/Exam_data.csv")
summary(exam_data)
      ID               CLASS              GENDER         
 Length:322         Length:322         Length:322        
 Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character  
                                                         
                                                         
                                                         
     RACE              ENGLISH          MATHS          SCIENCE     
 Length:322         Min.   :21.00   Min.   : 9.00   Min.   :15.00  
 Class :character   1st Qu.:59.00   1st Qu.:58.00   1st Qu.:49.25  
 Mode  :character   Median :70.00   Median :74.00   Median :65.00  
                    Mean   :67.18   Mean   :69.33   Mean   :61.16  
                    3rd Qu.:78.00   3rd Qu.:85.00   3rd Qu.:74.75  
                    Max.   :96.00   Max.   :99.00   Max.   :96.00  

The data are year end examination grades of cohort of Primary 3 students. Each student is represented by ID. Each student has a property of CLASS, GENDER, RACE, with their respective scores in English, Math and science.

Simple visualizations using ggplot2

ggplot 2 provides a very systematic way of creating visualizations.

To go through each layer, we use a simple example of building a Histogram that shows the distribution of MATH score.

Data

# we can use this format 
ggplot(data = exam_data)
# or this format 
exam_data %>%
  ggplot()

Both will give a blank canvas. The second format gives a certain flexibility to manipulate the data using dplyr for the graph on the fly. E.g. using filter function, or creating a temporary column just for this graph, keeping the original table intact. Professor uses the first version, however, it is my preference to use the second format throughout the rest of the code.

Aesthetics

Aesthetics is how we want to map the attributes of data to the visual characteristics e.g. axis, colours, size, shape, transparency.

As we want to create a histogram with MATH score, we will add MATH to x.

exam_data %>%
  ggplot(aes(x=MATHS))

We can see MATH score on the x-axis.

Geometrics

Next we add what sort of plot we want.

# for a normal bar plot 
exam_data %>%
  ggplot(aes(x=MATHS)) +
  geom_bar()
# e.g. we want to split by fill 
exam_data %>%
  ggplot(aes(x=MATHS, fill=GENDER)) +
  geom_histogram()

Here are the following common options available and simple customization (that I know of and would likely commonly use (something like a cheat sheet for myself)):

Common Geometric Types

geom_point()

geom_line()

geom_col()

geom_bar()

geom_histogram()

exam_data %>%
  ggplot(aes(x=MATHS)) +
  geom_histogram(bins = 20, 
                 color = "black", 
                 fill = "light blue")
exam_data %>%
  ggplot(aes(x=MATHS, fill=GENDER)) +
  geom_histogram(bins = 20, 
                 color = "grey30")

geom_dotplot()

exam_data %>%
  ggplot(aes(x=MATHS, fill=RACE)) +
  geom_dotplot(binwidth = 2.5, dotsize = 0.5)

combine geometrics

sequence to sometimes important. e.g. for below, the boxplot is below the scatter point.
exam_data %>%
  ggplot(aes(x=GENDER, y=MATHS)) +
  geom_boxplot() + 
  geom_point(position='jitter', size=0.5)

Interactivity with R: ggiraph

Difference between ggplot2 and ggiraph:

p <- exam_data %>%
  ggplot(aes(x=MATHS)) +
  geom_dotplot_interactive(aes(tooltip = CLASS, data_id = CLASS), 
                           stackgroups=TRUE, 
                           binwidth=1,
                           method='histodot') + 
  scale_y_continuous(NULL, breaks=NULL)

girafe(
  ggobj=p,
  width_svg=6,
  height_svg = 6*0.618
)

Interactivity with Plotly

Plotly has more interactivity options than ggiraph with less ‘coding’. The top right hand corner shows a panel of possible interactivity functions that can help us with our analysis.

basic layout

exam_data %>% 
  plot_ly(x = ~MATHS, 
          y = ~ENGLISH)

adding colour

The interactivity in plotly is different from Tableau. When we click on the categorical features in the legend. it ‘deselects’ rather than selects.

exam_data %>% 
  plot_ly(x = ~MATHS, 
          y = ~ENGLISH, 
          color = ~RACE)

We can also change the colour palette/scheme.

exam_data %>% 
  plot_ly(x = ~MATHS, 
          y = ~ENGLISH, 
          color = ~RACE,
          colors = "Set1")
pal <- c("red", "blue", "green", "purple")
exam_data %>% 
  plot_ly(x = ~MATHS, 
          y = ~ENGLISH, 
          color = ~RACE,
          colors = pal)

customizing tool type

exam_data %>% 
  plot_ly(x = ~MATHS, 
          y = ~ENGLISH, 
          text = ~paste("Student ID", ID,
                        "<br>Class:", CLASS),
          color = ~RACE,
          colors = "Set1")

adding layout

This is where we can add in/do the following:

exam_data %>% 
  plot_ly(x = ~MATHS, 
          y = ~ENGLISH, 
          text = ~paste("Student ID", ID,
                        "<br>Class:", CLASS),
          color = ~RACE,
          colors = "Set1") %>%
  layout(title = 'English Score versus Maths Score', 
         xaxis = list(range=c(0,100)), 
         yaxis = list(range=c(0, 100))
         )

using ggplotly (keeping gg plot and wrapping with plotly)

You first save ggplot as an object, then use ggplotly to ‘wrap’ it. Note: Sometime ggplot2 may not be 100% compatible with ggplotly. Need to be careful.

p <-ggplot(data =  exam_data, aes(x = MATHS, y = ENGLISH )) + 
  geom_point (dotsize = 1) + 
  coord_cartesian(xlim=c(0,100), 
                  ylim=c(0,100))

ggplotly(p)

Using Subplots

sub

d <- highlight_key(exam_data)

p1 <-ggplot(data =  d, aes(x = MATHS, y = ENGLISH )) + 
  geom_point (dotsize = 1) + 
  coord_cartesian(xlim=c(0,100), 
                  ylim=c(0,100))

p2 <-ggplot(data =  d, aes(x = MATHS, y = SCIENCE)) + 
  geom_point (dotsize = 1) + 
  coord_cartesian(xlim=c(0,100), 
                  ylim=c(0,100))

subplot(ggplotly(p1), ggplotly(p2))